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We consider the fundamental problem of estimating the mean of 
a vector y = X/3 + 2, where X is an n x p design matrix in which one 
can have far more variables than observations, and 2; is a stochastic 
error term — the so-called "p > n" setup. When /3 is sparse, or, more 
generally, when there is a sparse subset of covariates providing a 
close approximation to the unknown mean vector, we ask whether or 
not it is possible to accurately estimate Xp using a computationally 
tractable algorithm. 

We show that, in a surprisingly wide range of situations, the lasso 
happens to nearly select the best subset of variables. Quantitatively 
speaking, we prove that solving a simple quadratic program achieves 
a squared error within a logarithmic factor of the ideal mean squared 
error that one would achieve with an oracle supplying perfect infor- 
mation about which variables should and should not be included in 
the model. Interestingly, our results describe the average performance 
of the lasso; that is, the performance one can expect in an vast ma- 
jority of cases where X/3 is a sparse or nearly sparse superposition of 
variables, but not in all cases. 

Our results are nonasymptotic and widely applicable, since they 
simply require that pairs of predictor variables are not too coUinear. 

1. Introduction. One of the most common problems in statistics is to 
estimate a mean response XP from the data y = {yi,y2, ■ ■ ■ ,yn) and the 
linear model 

(1.1) y = Xl3 + z, 

where X is an n x p matrix of explanatory variables, (3 is a p-dimensional pa- 
rameter of interest and z = {zi, . . . , Zn) is a vector of independent stochastic 
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errors. Unless specified otherwise, we will assume that the errors are Gaus- 
sian with Zi ~ AA(0,cj^), but this is not really essential, as our results and 
methods can easily accommodate other types of distribution. We measure 
the performance of any estimator Xf3 with the usual squared Euclidean dis- 
tance — XPW'j^, or with the mean-squared error, which is simply the 
expected value of this quantity. 

In this paper, although this is not a restriction, we are primarily interested 
in situations where there are as many explanatory variables as observations, 
or more — the so-called, and now widely popular, "p > n" setup. In such 
circumstances, however, it is often the case that a relatively small number 
of variables have substantial explanatory power, so that, to achieve accurate 
estimation, one needs to select the "right" variables and determine which 
components Pi are not equal to zero. A standard approach is to find /3 by 
solving 

(1-2) mm^\\y-Xb\\l+Xoa^\\b\U„ 

where H^H^q is the number of nonzero components in b. In other words, 
the estimator (1.2) achieves the best trade-off between the goodness of fit 
and the complexity — in this case, the number of variables included — of the 
model. Popular selection procedures such as AIC, Cp, BIG and RIG are all 
of this form, with different values of the parameter (Aq = 1 in AIG [1, 19], 
Xq = ^logn in BIG [24], and Aq = logp in RIG [14]). It is known that these 
methods perform well both empirically and theoretically (see [14] and [2, 4] 
and the many references therein). Having said this, the problem, of course, is 
that these "canonical selection procedures" are highly impractical. Solving 
(1.2) is, in general, NP-hard [22] and, to the best of our knowledge, requires 
exhaustive searches over all subsets of columns of X, a procedure which is 
clearly combinatorial in nature and has exponential complexity, since, for p 
of size about n, there are about 2^ such subsets. 

In recent years, several methods based on li minimization have been 
proposed to overcome this problem. The most well-known is probably the 
lasso [26], which replaces the nonconvex Iq norm in (1.2) with the convex li 
norm ||6||^i = J2^=i l^ij- The lasso estimate P is defined as the solution to 

(1-3) mm^\\y-Xb\\l+Xa\\b\U„ 

where A is a regularization parameter that essentially controls the sparsity 
(or the complexity) of the estimated coefficients (see [23] and [11] for exactly 
the same proposal). In contrast to (1.2), the optimization problem (1.3) is a 
quadratic program that can be solved efficiently. It is known that the lasso 
performs well in some circumstances. Further, there is also an emerging 
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literature on its theoretical properties [3, 5, 6, 15, 16, 20, 21, 28, 29, 30] 
showing that, in some special cases, the lasso is effective. 

In this paper, we will show that the lasso works provably well in a surpris- 
ingly broad range of situations. We establish that, under minimal assump- 
tions guaranteeing that the predictor variables are not highly correlated, the 
lasso achieves a squared error nearly as good as if one had an oracle supply- 
ing perfect information about which /Sj's were nonzero. Continuing in this 
direction, we also establish that the lasso correctly identifies the true model 
with very large probability, provided that the amplitudes of the nonzero (di 
are sufficiently large. 

1.1. The coherence property. Throughout the paper, we will assume that, 
without loss of generality, the matrix X has unit-normed columns, as one 
can otherwise always rescale the columns. We denote, by Xi, the ith column 
of X [\\Xi\\(^^ = 1) and introduce the notion of coherence, which essentially 
measures the maximum correlation between unit-normed predictor variables 
and is defined by 

(1.4) ^^{X)= sup \{Xi,X,)\. 

l<i<j<p 

In other words, the coherence is the maximum inner product between any 
two distinct columns of X. It follows that, if the columns have zero mean, 
the coherence is just the maximum correlation between pairs of predictor 
variables. 

We will be interested in problems in which the variables are not highly 
collinear or redundant. 

Definition 1.1 [Coherence property). A matrix X is said to obey the 
coherence property if 

(1.5) ii{X) < Ao • {\ogp)-\ 
where Aq is some positive numerical constant. 

A matrix obeying the coherence property is a matrix in which the pre- 
dictors are not highly collinear. This is a mild assumption. Suppose that 
X is a Gaussian matrix with i.i.d. entries whose columns are subsequently 
normalized. The coherence of X is about (21ogp)/n, so that such matrices 
trivially obey the coherence property, unless n is ridiculously small [i.e., of 
the order of (logp)^]. We will give other examples of matrices obeying this 
property later in the paper, and we will soon contrast this assumption with 
what is traditionally assumed in the literature. 
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1.2. Sparse model selection. We begin by discussing the intuitive case, 
where the vector (5 is sparse, before extending our results to a completely 
general case. The basic question we would like to address here is, how well 
can one estimate the response XjS, when (3 happens to have only S nonzero 
components? From now on, we call such vectors S-sparse. 

First and foremost, we would like to emphasize that, in this paper, we are 
interested in quantifying the performance one can expect from the lasso in 
an overwhelming majority of cases. This viewpoint needs to be contrasted 
with an analysis concentrating on the worst case performance; when the 
focus is on the worst case scenario, one would study very particular values 
of the parameter /3 for which the lasso does not work well. This is not our 
objective; as an aside, this will enable us to show that one can reliably 
estimate the mean response XP under much weaker conditions than what 
is currently known. 

Our point of view emphasizes the average performance (or the perfor- 
mance one could expect in a large majority of cases); thus, we need a sta- 
tistical description of sparse models. To this end, we introduce the generic 
S-sparse model, which is defined as follows: 

1. The support I C {1, . . . ,p} of the S nonzero coefficients of (3 is selected 
uniformly at random. 

2. Conditional on /, the signs of the nonzero entries of /3 are independent 
and equally likely to be —1 or 1. 

We make no assumption on the amplitudes. In some sense, this is the sim- 
plest statistical model one could think of; it says, simply, that all subsets of 
a given cardinality are equally likely, and that the signs of the coefficients 
are equally likely. In other words, one is not biased towards certain vari- 
ables, nor do we have any reason to believe a priori that a given coefficient 
is positive or negative. 

Our first result is that for most 5-sparse vectors f3, the lasso is provably 
accurate. Throughout, \\X\\ refers to the operator norm of the matrix A (the 
largest singular value). 

Theorem 1.2. Suppose that X obeys the coherence property, and as- 
sume that f3 is taken from the generic S-sparse model. Suppose that S < 
Cop/[\\X\\'^ logp] for some positive numerical constant cq. Then, the lasso 
estimate (1.3) computed with A = 2^2 log p obeys 



(1.6) 



\\XI5-X'p\\l<C^-{2\ogp)-S-a'' 



with probability at least 1 — Qp 
may be taken as 8(1 -|- \/2)^- 



-2 log 2 



p -'^(27rlogp) ""^Z^. The constant Cq 
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For simplicity, we have chosen A = 2\/21ogp, but one could take any A 
of the form A = (1 + a)^/2logp with a > 0. Our proof indicates that, as a 
decreases, the probability with which (1.6) holds decreases, but the constant 
Co also decreases. Conversely, as a increases, the probability with which (1.6) 
holds increases, but the constant Cq also increases. 

Theorem 1.2 asserts that one can estimate XP with nearly the same 
accuracy as if one knew ahead of time which /3j's were nonzero. To see why 
this is true, suppose that the support / of the true /? was known. In this 
ideal situation, we would presumably estimate /3 by regressing y onto the 
columns of X , with indices in /, and construct 

(1.7) /3* = argmin ||y — X6||^2 subject to 6j = for all i ^ /. 

fee MP 

It is a simple calculation to show that this ideal estimator (it is ideal, because 
we would not know the set of nonzero coordinates) achieves^ 

(1.8) E\\Xp-Xp*\\j^=S-a^. 

Hence, one can see that (1.6) is optimal up to a factor proportional to logp. 
It is also known that one cannot, in general, hope for a better result; the 
log factor is the price we need to pay for not knowing ahead of time which 
of the predictors are actually included in the model. 

The assumptions of our theorem are pretty mild. Roughly speaking, if 
the predictors are not too collinear, and if S is not too large, then the lasso 
works most of the time. An important point here is that the restriction 
on the sparsity can be very mild. We give two examples to illustrate our 
purpose: 

• Random design. Imagine, as before, that the entries of X are i.i.d. AA(0, 1) 
and then normalized. Then, the operator norm of X is sharply con- 
centrated around \/p/n, so that our assumption essentially reads S < 
con/logp. Expressed in a different way, (3 does not have to be sparse at 
all. It has to be smaller, of course, than the number of observations, but 
not by a very large margin. 

Similar conclusions would apply to many other types of random matri- 
ces. 

• Signal estimation. A problem that has attracted quite a bit of attention 
in the signal processing community is that of recovering a signal that has 
a sparse expansion as a superposition of spikes and sinusoids. Here, we 
have noisy data y 

(1.9) y{t) = f{t) + zit), t = l,...,n, 



^ One could also develop a similar estimate with high probability, but we find it simpler 
and more elegant to derive the performance in terms of expectation. 
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about a digital signal / of interest, which is expressed as the "time- 
frequency" superposition 

n n 

(1.10) f{t) = y: af^6{t -k)+Y: w; 

k=l k=l 

(5 is a Dirac or spike obeying 5{t) = 1 if t = 0, and otherwise. [^k{i))i<k<n 
is an orthonormal basis of sinusoids. The problem (1.9) is of the general 
form (1.1) with X = [InFn] in which is the identity matrix, Fn is the 
basis of sinusoids (a discrete cosine transform) and (3 is the concatenation 
of a'^") and a^^). Here, p = 2n and ||X|| = y/2. Also, X obeys the coherence 
property if n or p is not too small, since ^{X) = = 

Hence, if the signal has a sparse expansion with fewer than on the order 
of n/logn coefficients, then the lasso achieves a quality of reconstruction 
that is essentially as good as what could be achieved if we knew in advance 
the precise location of the spikes and the exact frequencies of the sinusoids. 

This fact extends to other pairs of orthobases and to general overcom- 
plete expansions, as we will explain later. 

In our two examples, the condition of Theorem 1.2 is satisfied for S as 
large as on the order of n/logp; that is, /? may have a large number of 
nonzero components. The novelty here is that the assumptions on the spar- 
sity level S, and on the correlation between predictors, are very realistic. 
This is different from the available literature, which typically requires a much 
lower bound on the coherence or a much lower sparsity level (see Section 
4 for a comprehensive discussion). In addition, many published results as- 
sume that the entries of the design matrix X are sampled from a probability 
distribution (e.g., are i.i.d. samples from the standard normal distribution), 
which we are not assuming here (one could of course specialize our results 
to random designs as discussed above). Hence, we do not simply prove that 
in some idealized setting the lasso would do well, but that it has a very con- 
crete edge in practical situations — as shown empirically in a great number 
of works. 

An interesting fact is that one cannot expect (1.6) to hold for all models, 
as one can construct simple examples of incoherent matrices and special /3 
for which the lasso does not select a good model (see Section 2). In this 
sense, (1.6) can be achieved on the average — or better, in an overwhelming 
majority of cases — but not in all cases. 

1.3. Exact model recovery. Suppose, now, that we are interested in es- 
timating the set I = {i: Pi ^ 0}. Then, we show that, if the values of the 
nonvanishing /3j's are not too small, then the lasso correctly identifies the 
"right" model. 
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Theorem 1.3. Let I be the support of (3, and suppose that 

min|/3j| > '8>a\/2\ogp. 

Then, under the assumptions of Theorem 1.2, the lasso estimate with A = 
2y/2Togp obeys 

(1.11) supp(/3) = supp(/3) and 

(1.12) sgn(/3i) = sgn(/3i) for alii £ I, 

with probability at least 1 - 2p-i((27rlogp)-i/2 + \I\p~^) - 0{p-'^^°s^). 

In other words, if the nonzero coefficients are significant in the sense that 
they stand above the noise, then the lasso identifies all the variables of 
interest and only these. Further, the lasso correctly estimates the signs of 
the corresponding coefficients. Again, this does not hold for all /?'s, as shown 
in the example of Section 2, but for a wide majority. 

Our condition says that the amplitudes must be larger than a constant 
times the noise level times \/2 logp, which is sharp, modulo a small mul- 
tiplicative constant. Our statement is nonasymptotic, and relies upon [29] 
and [6]. In particular, [29] requires X and (3 to satisfy the Irrepresentable 
Condition, which is sufficient to guarantee the exact recovery of the support 
of /3 in some asymptotic regime; Section 3.3 connects with this line of work 
by showing that the Irrepresentable Condition holds with high probability 
under the stated assumptions. 

As before, we have decided to state the theorem for a concrete value of 
A, namely 2-^2 log p, but we could have used any value of the form (1 + 
a)^/2logp with a > 0. When a decreases, our proof indicates that one can 
lower the threshold on the minimum nonzero value of (3 but that, at the 
same time, the probability of success is also lowered. When a increases, the 
converse applies. Finally, our proof shows that, by setting A close to \/21ogp 
and imposing slightly stronger conditions on the coherence and the sparsity 
S, one can substantially lower the threshold on the minimum nonzero value 
of f3 and bring it close to a^/2Togp. 

We would also like to remark that, under the hypotheses of Theorem 1.3, 
one can somewhat improve the estimate (1.6) by using a two-step procedure 
similar to that proposed in [10]: 

1. Use the lasso to find I = {i: 0}. 

2. Find $ by regressing y onto the columns (Xi), i £ I. 

Since / = / with high probability, we have that 

\\Xp-Xf3\\l = \\P[I]z\\l 
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with high probabihty, where P[I] is the projection onto the space spanned by 
the variables (Xi). Because is concentrated around |/| -o"^ = S -a"^, 

it fohows that, with high probabihty, 

\\Xp-X(3\\l<C-S-a^ 

where C is a some small numerical constant. In other words, when the values 
of the nonzero entries of /? are sufficiently large, one does not have to pay 
the logarithmic factor. 

1.4. General model selection. In many applications, /3 is not sparse or 
does not have a real meaning, so that it does not make much sense to talk 
about the values of this vector. Consider an example to make this precise. 
Suppose we have noisy data y (1.9) about an n-pixel digital image /, where 
z is white noise. We wish to remove the noise (i.e., estimate the mean of the 
vector y). A majority of modern methods express the unknown signal as a 
superposition of fixed waveforms (v7j(t))i<j<p, 

(1-13) f{t) = j2Pmit), 

1=1 

and construct an estimate 

f{t)=j2PiV^{t). 
i=l 

That is, one introduces a model / = XP, in which the columns of X are the 
sampled waveforms ipi{t). It is now extremely popular to consider overcom- 
plete representations with many more waveforms than samples (i.e., p> n). 
The reason for this is that overcomplete systems offer a wider range of gen- 
erating elements that may be well suited to represent contributions from 
different phenomena; potentially, this wider range allows more flexibility in 
signal representation and enhances statistical estimation. 

In this setup, two comments are in order. First, there is no ground truth 
associated with each coefficient f3i; there is no real wavelet or curvelet co- 
efficient. Second, signals of general interest are never really exactly sparse; 
they are only approximately sparse, meaning that they may be well approx- 
imated by sparse expansions. These considerations emphasize the need to 
formulate results to cover those situations in which the precise values of /3j 
are either ill-defined or meaningless. 

In general, one can understand model selection as follows. Select a model — 
a subset / of the columns of X — and construct an estimate of XP by pro- 
jecting y onto the subspace generated by the variables in the model. Math- 
ematically, this is formulated as 

Xp[I]=P[I]y = P[I]Xp + P[I]z, 
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where P[I] denotes the projection onto the space spanned by the variables 
{Xi), i £ I. What is the accuracy of X(3[I]? Note that 



This is the classical bias variance decomposition; the first term is the squared 
bias one gets by using only a subset of columns of X to approximate the 
true vector Xp. The second term is the variance of the estimator and is 
proportional to the size of the model /. 

Hence, one can now define the ideal model achieving the minimum MSE 
over all models 



See Figure 1 for a visual representation. We will refer to this as the ideal 
risk. It is ideal in the sense that one could achieve this performance if we had 
available an oracle which, knowing XP, would select for us the best model 
to use (i.e., the best subset of explanatory variables). 

To connect this with our earlier discussion, one sees that, if there is a 
representation of / = XP in which /3 has S nonzero terms, then the ideal 
risk is bounded by the variance term, namely, S ■ [just pick / to be the 
support of (3 in (1.15)]. The point we would like to make is that, whereas 
we did not search for an optimal bias-variance trade off in the previous 
section, we will here. The reason is that, even in the case where the model 
is interpretable, the projection estimate on the model corresponding to the 
nonzero values of Pi may very well be inaccurate and have a mean-squared 
error that is far larger than (1.15). In particular, this typically happens if, 
out of the S nonzero /3j's, only a small fraction are really significant, while 
the others are not (e.g., in the sense that any individual test of significance 
would not reject the hypothesis that they vanish). In this sense, the main 
result of this section. Theorem 1.4, generalizes but also strengthens Theorem 



An important question is, of course, whether one can get close to the ideal 
risk (1.15) without the help of an oracle. It is known that solving the com- 
binatorial optimization problem (1.2) with a value of Aq being a sufficiently 
large multiple of logp would provide an MSE within a multiplicative factor 
of order logp of the ideal risk. That real estimators with such properties 
exist is inspiring. Yet, solving (1.2) is computationally intractable. Our next 
result shows that, in a wide range problems, the lasso also nearly achieves 
the ideal risk. 




(1.15) 



min \\{ld - P[I])Xpf + \I\a'^. 

IC{l,...,p} 



1.2. 



^It is, again, simpler to state the performance in terms of expectation. 
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X0 




Fig. 1. The vector X is the projection of XP on an ideally selected subset of covariates. 
These covariates span a plane of optimal dimension, which, among all planes spanned by 
subsets of the same dimension, is closest to Xfi. 

We are naturally interested in quantifying the performance one can expect 
from the lasso in nearly all cases, and, just as before, we now introduce 
a useful statistical description of these cases. Consider the best model Iq 
achieving the minimum in (1.15). In case of ties, pick one uniformly at 
random. Suppose /q is of cardinality S. Then, we introduce the best S- 
dimensional subset model, which is defined as follows: 

1. The subset Iq C {I, . . . ,p} of cardinality S is distributed uniformly at 
random; 

2. Define /3o with support /q via 

(1.16) X/3o=P[Io]Xp. 

In other words, /3o is the vector one would get by regressing the true mean 
vector Xp onto the variables in Iq; we call /3o the ideal approximation. 
Conditional on Iq, the signs of the nonzero entries of (3q are independent 
and equally likely to be —1 or 1. 

We make no assumption on the amplitudes. Our intent is just the same as 
before. All models are equally likely (there is no bias towards special vari- 
ables) , and one has no a priori information about the sign of the coefficients 
associated with each significant variable. 

Theorem 1.4. Suppose that X obeys the coherence property, and as- 
sume that the ideal approximation Pq is taken from the best S -dimensional 
subset model. Suppose that S < cop/[||X|p logp] for some positive numeri- 
cal constant cq. Then, the lasso estimate (1-3) computed with A = 2^2 log p 
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obeys 

\\X(3-XP\\l 



(1.17) 



< (1 + \/2) 



min \\XP-P[I]XP\\l+C',{2logp)-\I\-a-' 

/C{l,...,p} 



with probability at least 1 — 6p 2'°s2 _p ^(27rlogp) The constant C'q 
may be taken as 12 + 10\/2. 

In words, the lasso nearly selects the best model in a very large majority 
of cases. This also strengthens our earlier result, since the right-hand side in 
(1.17) is always less or equal to 0{\ogp)Sa'^ whenever there is an 5-sparse 
represent ation . ^ 

Theorem 1.4 guarantees excellent performance in a broad range of prob- 
lems. That is, if we have a design matrix X whose columns are not too 
correlated, then, for most responses Xf3, the lasso will find a statistical 
model with low mean-squared error; simple extensions would also claim that 
the lasso finds a statistical model with very good predictive power, but we 
will not consider these here. As an illustrative example, we can consider 
predicting the clinical outcomes from different tumors on the basis of gene 
expression values for each of the tumors. In typical problems, one considers 
hundreds of tumors and tens of thousands of genes. While some of the gene 
expressions (the columns of X) are correlated, one can always eliminate 
redundant predictors (e.g., via clustering techniques). Once the statistician 
has designed an X with low coherence, the lasso is guaranteed, in most cases, 
to find a subset of genes with near-optimal predictive power. 

There is a slightly different formulation of this general result which may 
go as follows. Let 5o be the maximum sparsity level 5*0 = [cop/[||X|p logpJJ , 
and, for each S < 5o, introduce As C {—1,0, 1}^ as the set of all possible 
signs of vectors /3 G M^, with sgn(/3j) = if /3i = 0, such that exactly S signs 
are nonzero. Then, under the hypotheses of our theorem, for each Xj3 G M", 

\\Xf3 -X^Wl 

(1.18) 

<min inin {I + V2)[\\XI3 - Xb\\l + C',{2\ogp) ■ S • a^] 



*We have assumed that the mean response / of interest is in the span of the columns 
of X (i.e., of the form X/3), which always happens when p>n and X has full column 
rank, for example. However, if this is not the case, the error would obey ||/ — X/3||^2 = 
\\Pf - X$\\j^ + l|(Id - P)f\\%, where P is the projection onto the range of X. The first 
term obeys the oracle inequality, so that the lasso estimates Pf in a near-optimal fashion. 
The second term is simply the size of the unmodelled part of the mean response. 
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with probability at least l — 0{p where one can still take = 12 + 10^2 
(the probability is with respect to the noise distribution). Above, Ao^s is a 
very large subset of As, obeying 

(1.19) \AoM/\M>'^-o{p-'). 

Hence, for most (3, the sub-oracle inequality (1.18) is actually the true oracle 
inequality. 

For completeness, Aq^s is defined as follows. Let b £ As be supported 
on /; bj is the restriction of the vector b to the index set /, and Xj is 
the submatrix formed by selecting the columns of X with indices in /. 
Then, we say that b E Ao,s if and only if the following three conditions 
hold: (1) the submatrix xjXj is invertible and obeys \\{X*j X^y^W < 2; (2) 
\\X*jcXi{X*jXi)-'^bi\\e^ < 1/4 (recall th at b G {-1,0, 1}^ is a sign pattern); 
(3) max^^j\\Xj{XjXi)~^XjXi\\ < co/^/logp for some numerical constant 
cq. In Section 3, we will analyze these three conditions in detail and prove 
that I ^0,5 1 is large. The first condition is called the invertibility condition, 
and the second and third conditions are needed to establish that a certain 
complementary size condition holds (see Section 3). 



1.5. Implications for signal estimation. Our findings may be of interest 
to researchers interested in signal estimation, and we now recast our main 
results in the language of signal processing. Suppose we are interested in 
estimating a signal f{t) from observations 



t = 0,. 



1, 



yit) = f{t) + zit), 

where z is white noise with variance cr^. We are given a dictionary of wave- 
forms {fi{t))i<i<p, which are normalized so that J2t=ofii^) — 1' 
looking for an estimate of the form f{t) = J2^=i c^iV^iit)- When we have an 
overcomplete representation in which p> n, there are infinitely many ways 
of representing / as a superposition of the dictionary elements. 

Now, introduce the best m-term approximation fm, which is defined via 



fn 



inf 



/-E 



aiipi 



that is, it is that linear combination of at most m elements of the dictionary 
that comes closest to the object / of interest.^ With these notations, if we 
could somehow guess the best model of dimension m, one would achieve an 
MSB equal to 



II/- 



^Note that, again, finding fm is generally a combinatorially hard problem. 
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Therefore, one can rewrite the ideal risk (which could be attained with the 
help of an oracle telling us exactly which subset of waveforms to use) as 

(1.20) min \\f-f^f +ma^ 

0<m<p 

which is exactly the trade-off between the approximation error and the num- 
ber of terms in the partial expansion.^ 

Consider, now, the estimate / = J2iOii^i, where a is the solution to 



1 

(1.21) min - 

^ ^ amp 2 



+ A(T||a| 



with A = 2\/21ogp, say. Then, provided that the dictionary is not too redun- 
dant in the sense that maxi<j<j<p | (99^, | < co/logp. Theorem 1.4 asserts 
that, for most signals /, the minimum-£i estimator (1.21) obeys 



(1.22) 11/ - /III < Co 



inf 11/ - /mlli + logp • mcj^ 



with large probability and for some reasonably small numerical constant 
Cq. In other words, one obtains a squared error that is within a logarithmic 
factor of what can be achieved with information provided by a genie. 

Overcomplete representations are now in widespread use, as in the field 
of artificial neural networks, for instance [12]. In computational harmonic 
analysis and image/signal processing, there is an emerging wisdom, which 
says that: (1) there is no universal representation for signals of interest, and 
(2) different representations are best for different phenomena ("best" is here 
understood as providing sparser representations). For instance: 

• sinusoids are best for oscillatory phenomena; 

• wavelets [18] are best for point-like singularities; 

• curvelets [7, 8] are best for curve-like singularities (edges); 

• local cosines are best for textures; and so on. 

Thus, many efficient methods in modern signal estimation proceed by form- 
ing an overcomplete dictionary, a union of several distinct representations, 
and then extracting a sparse superposition that fits the data well. The main 
result of this paper says that, if one solves the quadratic program (1.21), 
then one is provably guaranteed near-optimal performance for most signals 
of interest, which is why these results might be of interest to people working 
in this field. 



®It is also known that, for many interesting classes of signals and appropriately 
chosen dictionaries, taking the supremum over / £ JT in (1.20) comes within a log factor 
of the minimax risk for JT. 
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The spikes and sines model has been studied extensively in the literature 
on information theory in the nineties, and, there, the assumption that the 
"arrival times" of the spikes and the frequencies of the sinusoids are ran- 
dom is legitimate. In other situations, the model may be less adequate. For 
instance, in image processing, the large wavelet coefficients tend to appear 
early in the series, that is, at low frequencies. With this in mind, two com- 
ments are in order. First, it is likely that similar results would hold for other 
models (we just considered the simplest). And second, if we have a lot of a 
priori information about which coefficients are more likely to be significant, 
then we would probably not want to use the plain lasso (1.3) but rather 
incorporate this side information. 

1.6. Organization of the paper. The paper is organized as follows. In 
Section 2, we explain why our results are nearly optimal and cannot be 
fundamentally improved. Section 3 introduces a recent result due to Joel 
Tropp, regarding the norm of certain random submatrices, which is essential 
to our proofs and proves all of our results. We conclude with a discussion 
in Section 4, where, for the most part, we relate our work with a series of 
other published results and distinguish our main contributions. 

2. Optimality. 

2.1. For almost all sparse models. A natural question is whether one 
can relax the condition about (3 being generically sparse, or about XP being 
well approximated by a generically sparse superposition of covariates. The 
emphasis is on "generic," meaning that our results apply to nearly all objects 
taken from a statistical ensemble but perhaps not all. This begs a question: 
can one hope to establish versions of our results that would hold universally? 
The answer is negative. Even in the case when X has very low coherence, one 
can show that the lasso does not provide an accurate estimation of certain 
mean vectors XP with a sparse coefficient sequence. This section gives one 
such example. 

Suppose, as in Section 1.2, that we wish to estimate a signal assumed to 
be a sparse superposition of spikes and sinusoids. We assume that the length 
n of the signal f{t), t = 0, 1, . . . , n — 1, is equal to n = 2^-' for some integer 
j. The basis of spikes is as before, and the orthobasis of sinusoids takes the 
form 





k = l,2 



...,n/2-l 





(-l)VV^. 
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Recall the discrete identity (a discrete analog of the Poisson summation 
formula) 

2^-1 2^-1 , 

^<5(i-A;2^)=^^e'2.fc2n/n 
fc=0 fc=0 V 

(2.1) =1 (! + (_!)*)+ 2 ^ cos(27rA:2^t/n) 



2J-1-1 
k=l 

Then, consider the model 

y = l + z = XI3 + z, 
where 1 is the constant signal equal to 1, and X is the n x (2n — 1) matrix 

X =[In Fn^2:n] 

in which /„ is the identity (the basis of spikes) and -Fn,2:n is the orthobasis of 
sinusoids minus the first basis vector ipi. Note that this is a low-coherence 
matrix X, since /u(X) = yj2/n. In plain English, we are simply trying to 
estimate a constant-mean vector. It follows, from (2.1), that 



1 = 



■2J„l 2J-1-1 
.k=0 k=l 



SO that 1 has a sparse expansion, since it is a superposition of at most y/n 
spikes and y/n/2 sinusoids (it can also be deduced from existing results that 
this is actually the sparsest expansion). In other words, if we knew which 
column vectors to use, one could obtain 



E\\X(3*-XP\\l = lV^a^ 



(2.2) A = I 



How does the lasso compare? We claim that, with very high probability, 

Vi-Xa, ie{l,...,n}, 

ie {n + l,...,2n-l}, 

so that 

(2.3) XP = y-Xal, 

provided that Xa < 1/2. In short, the lasso does not find the sparsest model 
at all. As a matter of fact, it finds a model as dense as it can be, and the 
resulting mean-squared error is awful, since 

E\\Xp-Xp\\l^{l + X^)na\ 



16 



E. J. CANDES AND Y. PLAN 




'0 50 100 tSD 250 300 350 400 500 50 100 150 300 350 



W (b) 

Fig. 2. Sparse signal recovery with the lasso, (a) Values of the estimated coefficients. All 
the spike coefficients are obtained by soft-thresholding y and are nonzero, (b) Lasso signal 
estimate; Xf3 is just a shifted version of the noisy signal. 

Even if one could somehow remove the bias, this would still be a very bad 
performance. 

An illustrative numerical example is displayed in Figure 2. In this exam- 
ple, n = 256 so that p = 512 — 1 = 511. The mean vector Xf3 is made up as 
above, and there is a representation in which [5 has only 24 nonzero coef- 
ficients. Yet, the lasso finds a model of dimension 256 (i.e., select as many 
variables as there are observations). 

We need to justify (2.2), as (2.3) would be an immediate consequence. 
It follows, from taking the subgradient of the lasso functional, that /? is a 
minimizer if and only if 

X*{y-XP)=\asgn0i), A^O, 

(2-4) 

\X*{y-Xp)\<\a, Pi = 0. 
One can further establish that /3 is the unique minimizer of (1.3) if 
X:{y-XP) = Xasgn0i), / 0, 

(2-5) 

\X*{y-X(3)\<Xa, A = 0, 

and the columns indexed by the support of /3 are linearly independent 
(note the strict inequalities). We then simply need to show that (3, given by 
(2.2), obeys (2.5). Suppose that minj?/, > Act. A sufficient condition is that 
maxj \ zi\ < 1 — Act, which occurs with very large probability if Ac < 1/2 and 
A > \/ 2 log n [see (3.4) with X = I]. (One can always allow for larger noise by 
multiplying the signal by a factor greater than 1.) Note that y — X[3 = Acl, 
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so that, for i G {1, . . . , n}, we have 

x;(2/-X/3) = Af7 = Aasgn(/3i), 

whereas, for i G {n + 1, . . . , 2?i — 1}, we have 

X*{y-XP) = \a{X,,l) = Q, 

which proves our claim. 

To summarize, even when the coherence is low (i.e., of size about 1/ ^/n) 
there are sparse vectors /3 with sparsity level about equal to ^/n, for which 
the lasso completely misbehaves (we presented an example but there are 
of course many others). Therefore, it is a fact that none of our theorems, 
namely, Theorems 1.2-1.4, can hold for all /5's. In this sense, they are sharp. 



2.2. For sufficiently incoherent matrices. We now show that predictors 
cannot be too collinear, and we begin by examining a small problem in 
which X is a 2 X 2 matrix X = [Xi^X2\. We violate the coherence property 
by choosing Xi and X2, so that {Xi,X2) = 1 — e, where we think of e as 
being very small. Assume, without loss of generality, that a = 1 to simplify. 
Now, consider 

1 



a- 



-1 



where a is some positive amplitude and observe that X[5 = ae ^{Xi — X2), 
and X*X(3 = a(l, —1)*. It is well known that the lasso estimate (3 vanishes 
if \\X*y\\e^ < A. Now, 



\X*y\ 



<a+\\X*z\ 



so that, if a = 1, say, and A is not ridiculously small, then there is a positive 
probability ttq that /3 = 0, where ttq > P(||X*2;||oo < A — 1).''' For example, 
if A > 1 + 3 = 4, then $ = 0, as long as both entries of X*z are within 3 
standard deviations of 0. When /3 = 0, the squared error loss obeys 



a 

2-, 

e 



which can be made arbitrarily large if we allow e to be arbitrarily small. 

Of course, the culprit in our 2-by-2 example is hardly sparse, and we now 
consider the n x n diagonal block matrix Xq (n even) 

'X 

X 



Xa 



X 



^TTo can be calculated since X*z is a bivariate Gaussian variable. 
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with blocks made out of n/2 copies of X. We now consider (3 from the S- 
sparse model with independent entries sampled from the distribution (we 
choose a = 1 for simplicity but we could consider other values as well) 



Certainly, the support of [3 is random, and the signs are random. One 
could argue that the size of the support is not fixed (the expected value is 
so that (3 is sparse with very large probability) but this is obviously 
unessential.* 

Because Xq is block diagonal, the lasso functional becomes additive, and 
the lasso will minimize each individual term of the form ^HXft*^*^ — y'-*^!!!^ + 
A||6(*)||^j, where 6^ = (62j-i,&2i) and y^*) = {y2i~i,y2i)- If, for any of these 
subproblems, = ±£"^(1,— 1) as in our 2-by-2 example above, then the 
squared error will blow up (as e gets smaller) with probability ttq. With 
i fixed, P(/?W = ±£"-^(1,-1)) = 2/n and, thus, the probability that none 
of these sub-problems is poised to blow up is (1 — ^)"'^^ — > ^ as n — > oo. 
Formalizing matters, we have a squared loss of at least 2/e with probability 
at least 7ro(l — (1 — :|)'^^^)- Note that, when n is large, A is large, so that ttq 
is close to 1, and the lasso badly misbehaves with a probability greater or 
equal to a quantity approaching 1 — 1/e. 

In conclusion, the lasso may perform badly, even with a random (5, when 
all our assumptions are met but the coherence property. To summarize, an 
upper bound on the coherence is also necessary. 

3. Proofs. In this section, we prove all of our results. It is sufficient to 
establish our theorems with cr = 1 , as the general case is treated by a simple 
rescaling. Therefore, we conveniently assume that cr = 1 from now on. Here, 
and in the remainder of this paper, xi is the restriction of the vector x to an 
index set I, and, for a matrix X, Xi is the submatrix formed by selecting the 
columns of X with indices in /. In the following, it will also be convenient 
to denote, by the functional 



in which Ap = \J2 \ogp. 



We could alternatively select the support at random and randomly assign the signs, 
and this would not change our story in the least. 




-1 /2 

w. p. n ' , 

-1 /2 

w. p. n ' , 

w. p. 1 - 2n-^/2^ 




K{y,b) = }^\\y-Xh\\%+2X4h\\t,, 
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3.1. Preliminaries. We will make frequent use of subgradients, and we 
begin by briefly recalling what these are. We say that n G M'' is a subgradient 
of a convex function / : — > M at xq if / obeys 

(3.2) f{x)>f{xo) + {u,x-xo) 
for all X. 

Further, our arguments will repeatedly use two general results that we 
now record. The first states that the lasso estimate is feasible for the Dantzig 
selector optimization problem. 

Lemma 3.1. The lasso estimate obeys 

(3.3) \\X*iy-Xme^<2Xp. 

Proof. Since (3 minimizes f{b) = K{y,b) over b, must be a subgradi- 
ent of / at p. Now, the subgradients of / at 6 are of the form 

X*iXb-y) + 2Xpe, 

where e is any p-dimensional vector obeying = sgn(6j) if 6j 7^ and \ei\ < 1 
otherwise. Hence, since is a subgradient at /?, there exists e as above such 
that 

X*{XP-y) = -2Xpe. 
The conclusion follows from ||e||^3^ < 1. □ 

The second general result states that ||X*2:||£_^ cannot be too large. With 
large probability, z ~ AA(0, 1) obeys 

(3.4) \\X*z\U^=max\{X,,z)\<Xp. 

This is standard and simply follows from the fact that {Xi,z) ^J\f (0,1). 
Hence, for each t > 0, 

(3.5) F{\\X*z\U^>t)<2p-^{t)/t, 

where (l){t) = (27r)~^/^e~*^/^. Better bounds may be possible, but we will 
not pursue these refinements here. Also, note that ll-'^*^!!^^^ < V2Xp with 
probability at least 1 — p"^(27rlogp)~^/'^. These two general facts have an 
interesting consequence, since it follows from the decomposition y = Xf3 + z 
and the triangle inequality that, with high probability, 

\\X*X{P - /3)||,^ < \\X*{X(3 - y)\\,^ + \\X*{y - X/3)||,^ 

(3.6) =\\X*z\\,^ + \\X*{y-Xm^ 

< {V2 + 2)Xp. 
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3.2. Proof of Theorem 1.2. Put / for the support of [3. To prove our 
claim, we first establish that (1.6) holds provided that the following three 
deterministic conditions are satisfied: 

• Invertibility condition. The submatrix XjXj is invertible and obeys 

(3.7) \\(X*jXi)-^\\<2. 

The number 2 is arbitrary; we just need the smallest eigenvalue of X^Xj 
to be bounded away from zero. 

• Orthogonality condition. The vector z obeys ||^*^||£oo — V^^p- 

• Complementary size condition. The following inequality holds 

\\X^.Xi{X*jXjr'X^z\\t^ + 2Xp\\X^.Xi{X*jXiyhgn{(3i)\\e^ 

(3.8) 

< (2 - V2)Xp. 

This section establishes the main estimate (1.6), assuming these three condi- 
tions hold, whereas the next will show that all three conditions hold with 
large probability, hence proving Theorem 1.2. Note that, when z is white 
noise, we already know that the orthogonality condition holds with proba- 
bility at least 1 - p"^(27rlogp)~^/^. 

Assume, then, that all three conditions above hold. Since /3 minimizes 
K{y,b), we have K{y,l3)<K{y,(3) or, equivalently, 

\\\y - X$\\l + < - XPWl + 2Ap||/3||,, . 

Set h = P — P, and note that 

\\y - XPWl = \\{y- X(3) - Xh\\l = \\Xh\\l + \\y - X(5\\l - 2{Xh, y - X(i). 

Plugging this identity with z = y — X(3 into the above inequality and rear- 
ranging the terms gives 

(3.9) \\\xh\\i < {xh,z)+2\,mw - mw)- 

Next, break h up into /ij and hjc (observe that /3/c = hjc) and rewrite (3.9) 
as 

\\\Xh\\l < {h,X*z) + 2Xp{\\(3j\U, - m + hj\U, - \\hic\U,). 

For each i £ I, we have 

\Pi\ = \Pi + hi\ > + sgniPi) hi 

and, thus, + > + (/i/, sgn(/3/)). Inserting this inequality above 
yields 

(3.10) ^\\Xh\\l<{h,X*z) -2Xp{{hj,sgn{(3i)) + \\hj4e,). 
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Observe, now, that {h,X*z) = {hj,Xjz) + (hjc^Xjcz) and that the orthog- 
onahty condition imphes 

{hjc,x*j.z) < \\hi4,,\\x}.z\u^<V2Xp\\hj4i,. 

The conchision is the useful estimate 

(3.11) kW^Hl < {hi,v) - (2 - V2)Xp\\hi4e„ 

where v = XjZ — 2Xp sgn (/?/). 

We complete the argument by bounding {hj,v). The key here is to use 
the fact that is known to be small, as pointed out by Terence 

Tao [25]. We have 

{hi,v) = {{X*jXi)-^X*jXjhj,v) 

(3.12) ={X}Xihj,{X*jXj)-^v) 

= {X}Xh, {XjXiY^) - {X*jXjchic, {X}Xi)~\) =Ai- A2. 
We address each of the two terms individually. First, 

A,<\\x^xh\u^-\\ix*jXjy'v\w 

and 

\\{x*jXjr\\u,<^-\\{x^Xj)-\\U, 

<^-\\{x*jXjr'\\\\v\u, 

<s-\\{x^xjr4\\v\u^. 

Consider the following: (1) < (2 + ^/2) Xp by Lemma 3.1 together 

with the orthogonality condition [see (3.6)], and (2) ||(X|X/)~^||^2 < 2 by 
the invertibility condition. Because of this, we have 

Ai<2{2 + V2)XpS\\v\U^. 

However, 

ll^^ll^^ < + 2Ap < (2 + y2)Ap, 

so that 

(3.13) Ai<2{2 + V2fxl- S. 

Second, we simply bound the other term A2 = {hjc, XjcXj{Xj Xj)^^v) by 

1^2! < \\hi4e4X%Xi{X}Xir\\U^ 
with V = Xfz — 2Xp sgn(/3/). Since 
\\X%Xj{X*jXi)-\\\e^ 

< \\X%Xj{X*jXj)-^X*iz\\e^+2Xp\\X%Xi{X*jXj)-^sgn{PT4e^ 

< (2 - V2)Xp 
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because of the complementary size condition, we have 

\A2\<{2-V2)Xp\\hj4i^. 

To summarize, 

(3.14) \{hi,v)\<2{2 + V2fxl-S + {2-V2)Xp\\hjc\\^^. 

We conclude by inserting (3.14) into (3.11), which gives 
'^\\X0-P)\\l<2{2 + V2fxl.S, 
which is what we needed to prove. 

3.3. Norms of random submatrices. In this section, we establish that the 
invertibility and the complementary size conditions hold with large proba- 
bility. These essentially rely on a recent result of Joel Tropp, which we state 
first. 



Theorem 3.2 [27]. Suppose that a set I of predictors is sampled using 
a Bernoulli model by first creating a sequence {Sj)i<j<p of i.i.d. random 
variables with 6j = 1 w.p. S/p and 6j = w.p. 1 — S/p, and then setting 
I = {j : dj = 1} so that E|/| = S. Then, for q = 21ogp, 



(3.15) {E\\X*jXi - Id P)^/"? < 30fi{X) logp + i3^^:^H!i^ 
provided that SWXW^ /p < 1/4. In addition, for the same value of q 

(3.16) (^EmaxllXlX.ll^^y^' < MX)VWp + \/ S\\X\\yp. 

The first inequality (3.15) can be derived from the last equation in Section 
4 of [27]. To be sure, using the notations of that paper and letting H = 
X*X — Id, Tropp shows that 

Eg\\RHR\\ < 15qEg\\RHR'\\^^^ + 12^\\HR\\i^2 + 26\\H\\, 6 = S/p, 

where q = max{g, 21ogp}. Now, consider the following three facts: (1) x 
HR'\\ra^^< H{X); (2) \\HR\\i^2< ||^||;and (3) \\H\\ < ||X|| 2. The first asser- 
tion is immediate. The second is justified in [27]. For the third, observe that 
— Id|| < max{||X|p — 1, 1} (this is an equality when p > n), and the 
claim follows from ||X|| > 1, which holds, since X has unit-normed columns. 
With q = 21ogp, this gives 



I RHRU < 30MX) logp + + mMl. 

^ P P 
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Suppose that 5||X|p/p< 1/4; then, we can simplify the above inequality 
and obtain 

Mq\\RHR\\ <30n{X)\ogp+ (12V21ogp+ l)y'S'||X||2/p, 

which imphes (3.15). The second inequality (3.16) is exactly Corollary 5.1 
in [27]. 

The inequalities (3.15) and (3.16) also hold for our slightly different model, 
in which / C {1, . . . ,p} is a random subset of predictors with S elements, 
provided that the right-hand side of both inequalities be multiplied by 2^^'' . 
This follows from a simple Poissonization argument, which is similar to that 
posed in the proof of Lemma 3.6. 

It is now time to investigate how these results imply our conditions, and 
we first examine how (3.15) implies the invertibility condition. Let / be a 
random set and put Z = \\ X| Xj — Id 1 1 . Clearly, if Z < 1 /2 , then all the eigen- 
values of XjXj are in the interval [1/2,3/2] and ||(X|X/)~^|| < 2. Suppose 
that n{X) and 5 are sufficiently small, so that the right-hand side of (3.15) 
is less than 1/4, say. This happens when the coherence fi{X) and S obey 
the hypotheses of the theorem. Then, by Markov's inequality, we have that, 
for q = 21ogp, 

P(Z > 1/2) < < il/2y. 

In other words, the invertibility condition holds with probability exceeding 

1 _p-21og2_ 

Recalling that the signs of the nonzero entries of (3 are i.i.d. symmetric 
variables, we now examine the complementary size condition and begin with 
a simple lemma. 



Lemma 3.3. Let {Wj)j^j be a fixed collection of vectors in i2{I) end 
consider the random variable Zq defined by Zq = max^gj \{Wj,sgn((3j))\. 
Then, 



(3.17) P(Zo>t)<2|J| -e" 



for any k obeying k> ma,Xji^j\\Wj\\e^. Similarly, letting {Wj)ji^j be a fixed 
collection of vectors in and setting Z\ = maxjgj |(iyj',2;)], we have 



(3.18) P(Zi>t)<2|J|-e-*/2'' 



for any k obeying k > maxjg j || W'" - ^ 



^Note that this lemma also holds if the collection of vectors {Wj)jizj is random, as long 
as it is independent of sgn(/3/) and z. 
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Proof. The first inequality is an application of Hoeffding's inequality. 
Indeed, letting Zqj = {Wj,sgn{(3j)), Hoeffding's inequality gives 

(3.19) P(|^o,il >t)< 2e^*'/^"^^"'2 < 2e~*'/^'"^^^' "'^^"'a. 

Inequality (3.17) then follows from the union bound. The second part is even 
easier, since Zij = {W!j,z) ~AA(0, HVFjUfj); thus, 

(3.20) P(|Zi,,| >t)<2e^*'/'ll^ill?2 <2e-*'/''"^^^"^i"?2. 
Again, the union bound gives (3.18). □ 

For each i £ I^, define Zo^i and Zi^i as 

Zo,^ = X:XI{X*JXJ)-'sgn{PJ) and Z^^^ = X* Xi{X}Xi)-^X} z. 

With these notations, in order to prove the complementary size condition, 
it is sufficient to show that, with large probability, 

2XpZQ + Zi < {2 — \/2)Ap, 

where = maxjg/c |Zo,i| and likewise for Zi. Therefore, it is sufficient to 
prove that, with large probability, 

< 1/4 and Zi < (3/2 - V2)Xp. 

The idea is of course to apply Lemma 3.3 together with Theorem 3.2. We 
have 

Zo,^ = {W^,sgn{/3J)) and Zi,i = {Wl,z), 

where 

W^ = iX}XIr'X*JX^ and = Xj{X*jXir^X*jXi. 

Recall the definition of Z above and consider the event E = {Z < 1/2} n 
{maxjg/c < 7} for some positive 7. On this event, all the singu- 

lar values of Xj are between 1/^2 and ^/3/2■, thus, ||(X|X/)-i|| < 2 and 
\\Xi{X}Xiy^\\<V2, which gives 

||Wi||<27 and ||W/|| < V27. 

Applying (3.17) and (3.18) gives 

F{{Zo >t}U {Zi > u}) < F{{Zo >t}U {Zi > u}\E) + P(^^) 

<F{Zo>t\E)+¥{Zi>u \E)+F{E'') 

<2pe-*'/8^'+2pe""'/^^' 

+ P(Z> 1/2) +pfmax||X|X,|| >7y 
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We already know that the second to last term of the right-hand side is 
less than p-2iog2^ provided that /^(X) and S obey the conditions of the 
theorem. For the other three terms, let 70 be the right-hand side of (3.16). 
For t = 1/4, one can find a constant cq such that, if 7 < co/\/logp, then 
2pg-i /8-y < 2p-2iog2^ gg^y_ Likewise, for u = (3/2 — y/2)Xp, we may have 

2pe~" /^"^ < 2p~^^°^'^. The last term is treated by Markov's inequality, since, 
for q = 2logp, (3.16) gives 

¥(max\\X*jXi\\>-i) < -f'" -Ef max\\X*jXi\A < (70/7)''. 

Therefore, if 70 < 7/2 = co/2-y/logp, we have that this last term does not 

exceed 1 — p^2iog2_ -^qy ix{X) and S obeying the hypotheses of Theorem 

1.2, it is indeed the case that 70 < co/2\/logp. In conclusion, we have shown 

that all three conditions hold under our hypotheses with probability at least 
1 - 6p-2i°g2 _p-i(27rlogp)-^/2_ 

In passing, we would like to remark that proving that < 1/4 estab- 
lishes that the strong irrepresentable condition from [29] holds (with high 
probability). This condition states that, if 1 is the support of /?, 

\\X%Xi{X*,Xi)-\gTv[(5i)\\i^ <^-y. 

where v is any (small) constant greater than zero (this condition is used to 
show the asymptotic recovery of the support of /3). 

3.4. Proof of Theorem 1.4- The proof of Theorem 1.4 parallels that of 
Theorem 1.2, and we only sketch it, although we carefully detail the main 
differences. Let Iq be the support of Pq. Just as before, all three conditions of 
Section 3.2, with Iq in place of / and /3o in place of hold with overwhelming 
probability. From now on, we just assume that they are all true. 

Since /3 minimizes K{y,b), we have K(y,f3) < K{y,[3Q) or, equivalently, 

(3.21) - XPWl + 2\MU, < - X(3o\\l + '^^WoW 

Expand \\y - XpWj^ as 

\\y - XPWl = \\z - {X$ - XP)\\l = \\z\\l - 2{z,Xp - X(3) + \\XP - XP\\l 

and \\y — X/JoWf^ in the same way. Then, plug these identities into (3.21) to 
obtain 

- XpWl < i||X/3o - XpWl + {z,Xp - Xpo) 

(3.22) 

+ 2Ap(||/3o|k-||/3||,,). 
Put h = (3 — (Sq. We follow the same steps as in Section 3.2 to arrive at 
i||X/3 - X(3\\l < IWXPo - XPWl + {hi„v) - (2 - V2)Xp\\hj^\\e^, 
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where v = X}^z — 2ApSgn(/3/Q). Just as before, 
{hj„v) = {X%XK {X%Xi,)-^v) - {hj.,XlXj.{XlXi,r\)^A,-A,. 

By assumption, \A2\ < (2 — \/2)Ap • The difference is now in Ai, 

since we can no longer claim that < (2 + \/2)Ap. Decompose Ai 

as 

A, = {XlX0-P),{XlXj,r'v) + {XlX{/3-Po),{X^,Xj,)-\)^A'i + A\. 

Because — P)\\e^ < {2 + \/2)Xp, one can use the same argument as 

before to obtain 

A? < 2(2 + V2fxlS. 

We now look at the other term. Since we assume < \/2, 

we have 

\A\\ = {X{l3-(3o),Xi,{XlXi,r'v) 

<\\XiP - Po)\\i,\\Xi,{XlXj,)-'v\U, 

<V2\\XiP-Po)\\iMU2- 
Using ab < (a^ + b'^)/2 and \\v\\j^ < (2 + V2)'^\lS gives 

|^}| < ^\\XiP - Po)\\l + ^(2 + V2fxlS. 

To summarize, 

{hi,,v)<%X{P-(3o)\\l + {2 + ^)i2 + V2fxlS + i2-V2)Xp.\\hjc\\,^. 
It follows that 

i||X/3 - XPWl < ^\\XPo - XPWl + (4 + V2){l + V2)'XIS. 

This concludes the proof. 

We close this section by arguing about (1.18) and (1.19). First, it follows 
from our proof that (1.18) holds. Second, our analysis also shows that the 
set ^o,s is very large and obeys (1.19). 

3.5. Proof of Theorem 1.3. Just as with our other claims, we begin by 
stating a few assumptions that hold with very large probability, and then 
we show that, under these conditions, the conclusions of the theorem hold. 
These assumptions are as follows: 

(i) The matrix X*jXi is invertible and obeys ||(X|X/)"i|| < 2; 

(ii) \\x%xj{x*,xir^^gum\\i^ < h 

(iii) \\iX}Xir'X}z\\,^<2Xp; 
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(iv) \\X].{I-P[I])z\U^<V2Xp; 

(v) The matrix-vector product (XjXj) ^sgn(/3/) obeys 

(3.23) \\{X*jXr)-hgni(3i)\U^<3. 

We already know that conditions (i) and (ii) hold with large probability 
[see Section 3.3; the change from 1/2 to 1/4 in (ii) is unessential]. As before, 
we let E be the event — Id|| < 1/2}. For (iii), the idea is the same, 

and we express ||(X|X/)~^X|z||^^ as max-jg/ where Wi is now the 

ith row of {X*Xi)-^X*. On E, max^ ||VF,|| < \\iX*Xi)-^X*\\ < ^2, and the 
claim now follows from (3.5). Indeed, one can check that conditional on E 

F{\\{X^Xi)-'X*jz\\e^ > 2Xp) < \I\ -p-' ■ (27rlogp)-i/2. 

For (iv), we write ||X|c(/ — P[/])z||£^ as maxjg/c |(Wj,z)|, where Wj = (/ — 
P[I])Xi. We have < H-'^^jll = 1 and, conditional on E, it follows, from 

(3.5), that 

F{\\XUl-P[I])z\U^ > V2X,) < • (27rlogp)-i/2. 

The subtle estimate is (v), and it is proven in the next section. There, 
we show that (3.23) holds with probability at least 1 - 2p-2i°g2 _ 2\I\p-'^. 
Hence, under the assumptions of Theorem 1.3, (i)-(v) hold with probability 
at least 1 - 2p-^{{2Trlogp)-^/^ + \I\/p) - 0(p-2i°g2)^ 

Lemma 3.4. Suppose that the assumptions (i)-(v) hold, and assume that 
minjg/|/3j| obeys the condition of Theorem 1.3. Then, the lasso solution is 
given by (3 = (3 + h with 

hi = {X}Xiy\X*jZ - 2ApSgn(/?7)], 

(3.24) 

hjc = 0. 

Proof. The point (3 is the unique solution to the lasso functional if 

X*(y-X/3) = 2ApSgn(A), A / 0, 

(3.25) 

\x*{y-xp)\<2Xp, A = o, 

and the columns of Xt are linearly independent where T is the support of 
13. Consider, then, h as in (3.24), and observe that 

\\hih^ < \\{XjXir'X*jz\\,^+2Xp\\{X*jXir^sgn{(3i)\U^<2X^ + QXp. 

It follows that < minjg/ \(3i\ and, therefore, (3 = (3 + h obeys 

supp(/3) = supp(/3), 

sgn(/37) = sgn(/3/). 
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We now check that (3 = (3 + h obeys (3.25). By definition, we have 

y-Xp = z-Xh = z- Xj{X*jXir^[X}z - 2XpSgn0j)], 

since /3 and /3 share the same support and the same signs. Clearly, 

X]{y-XP) = 2XpSgn0i), 

which is the first half of (3.25). For the second half, let P[I] = Xj{X^Xjy^X*j 
be the orthonormal projection onto the span of X/. Then, 

\\XUy - Xmio. = W^UI - + 2XpX*j.XjiX*jXjrhgn{pj)\U^ 

< \\XUI - P[mU^ +'^>^p\\XhXiiX*jXjrhgnif3i)\U^ 
<V2Xp + ^Xp 

< 2Xp. 

Finally, note that X^Xj- is indeed invertible, since T = I; this is just our 
invertibility condition. This concludes the proof. □ 

Lemma 3.4 proves that /? has the same support as (3 and the same signs 
as P, which is of course the content of Theorem 1.3. 

3.6. Proof of (3.23). We need to show that ||(X|X/)-isgn(/3j)||^^ < 3 
with high probability. To begin, we write 

||(X|X,)-isgn(/3,)||,^ < ||sgn(/?,)||,^ + ||((X|X,)-i - Id)sgn(/?,)||,^ 

<l + max|(M^i,sgn(/3/))|, 

where Wi is the zth row of (X|X/)~^ — Id (or column since this is a sym- 
metric matrix). 

Lemma 3.5. Let Wi he the ith row of {XjXi)~^ - Id. Under the hy- 
potheses of Theorem 1.3, we have 

p(^max||Wi|| > (logp)^^/^^ <2p~2^°s2^ 

Proof. Set yl = Id - XJXi. On the event E = {||Id - X}Xi\\ < 1/2} 
(which holds w.p. at least 1 — p~^'°s2), we have 

{X*jXi)-'^ = I + A + A^ + ---. 
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Therefore, since Wi = {{XjXj) ^ — Id)ej where is the vector whose ith 
component is 1 and the others 0, Wi = Aei + A^ei + ■ ■ ■ and 

\\Wi\\ < \\Aei\\ + Pllpcill + P^IIPcill + ••• 

oo 

< \\Ae,\\ \\Af 

k=0 

<\\Ae,\\/il-\\A\\). 

Hence, on E, \\Wi\\ <2||^ej||. 

For each f G /, Aci is the zth row or column of Id — XjXj and for each 
j G /, its jth component is equal to —{Xi,Xj) if j ^ i, and for j = i since 
||Xj|| = 1. Thus, 

Now, it follows from Lemma 3.6 that 

Y \{X„X,)\'<S\\Xf/p + t 

with probability at least 1 - 2e-*'/[2M'(-^)(^li^l''/P+*/3)] . Under the assump- 
tions of Theorem 1.3, we have < co(logp)^^ < {S\ogp)~^ provided 
that CO < 1/8. With t = (81ogp)^\ this gives 

(3.26) K^.,^,)l'<l/(41ogp) 

with probability at least 1 — 2e~^/^^'^^'^ Now, the assumption about 
the coherence guarantees that ^-{X) < y4o/logp so that (3.26) holds with 
probability at least 1 — 2e~^^°^^^^^'^^o\ . Hence, by choosing sufficiently 
small, the lemma follows from the union bound. □ 



Lemma 3.6. Suppose that I C {1, . . . ,p} is a random subset of predictors 
with at most S elements. For each i, 1 <i <p, we have 



(3.27) 

< 2exp 



( Y \{x.,x,)\^>^\\xf + t) 



t^ 



2/.2(X)(S||X||Vp + t/3) 



Proof. The inequality (3.27) is essentially an application of Bernstein's 
inequality, which states that, for a sum of uniformly bounded independent 
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random variables with \Yk — Elfcl < c, 

(3.28) '^(^1^"^'^ ~ ^^''^ ^ " e"*'^^^"'^^'*^^^ 

where o"^ is the sum of the variances, o"^ = X]fc=i Var(Yfc). The issue here 
is that is not a sum of independent variables and we 

need to use a kind of Poissonization argument to reduce this to a sum of 
independent terms. 

A set /' of predictors is sampled using a Bernoulli model by first creating 
the sequence 

r 1, w.p. S/p, 
lO, w.p. 1-S/p, 

and then setting I' = {j £ {1, . . . ,p} : 6j = 1} . The size of the set /' follows 
a binomial distribution, and E|I'| = 5". We make two claims: first, for each 
t > 0, we have 

(3.29) f( Y1 \{X^,X,)\'>t]<2F( \{Xi,Xj)\'>t\, 
second, for each t > 0, 



(3.30) 

< exp 



( E \{X,,X,)\^>^\\Xf + t] 



2^?{x){s\\xf/p + t/?,) 



Clearly, (3.29) and (3.30) give (3.27). 
To justify the first claim, observe that 

U y: \{x..x,)\''>^\=yU y m,x,)\'>t\\i'\=k]p{\i'\=k) 

k=s Kjei'-.jf^i J 
= E KX„X,)|2>t')p(|/'| = A;), 

where It is selected uniformly at random with = k. We make two ob- 
servations: (1) since S is an integer, it is the median of |/'| and -P(|/'| > 
S) > 1/2; and (2) \{Xi,Xj)\'^ > t) is a nondecreasing function of 

k. To see why this is true, consider that a subset Ik+i of size k + 1 can be 
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sampled by first choosing a subset Ik of size k uniformly and then choos- 
ing the remaining entry uniformly at random from the complement of Ik- 
It follows that, with Zk = we have that Z^+i and 

Zk + Yk, where Yfc is a nonnegative random variable have the same distribu- 
tion. Hence, ¥{Zk+i >t)> ¥{Zk > t). With these two observations in mind, 
we continue 

f( J2 \{X^,X,)\'>t]>F( Y: \{X^,X,)\'>t)j2m'\ = k) 

J I k=S 



> 

- 2 



which is the first claim (3.29). 

For the second claim (3.30), observe that 

y: \{x.,x,)\'= y sm,x,)\'^ Y 

The Yj are independent and obey: 

1. \Yj -EY,\ <snpj^,\{Xi,X,)\^ < fi^iX). 

2. The sum of means is bounded by 

Y ^Y, = '- Y m,x,)\-<'-^. 

The last inequality follows from T.i<j<p:j^i \{Xi,Xj)\^ < Y.i<j<p\{Xi,Xj)\'^ 
where the right-hand side is equal to < = \\X\\^ 

since the columns are unit-normed. 

3. The sum of variances is bounded by 

Y. v„(y;) = f(i-^) E 

The last inequality follows from Yl,i<j<p:j^i\{Xi, Xj)\'^ < tx^{X) Y,i<j<p\{Xi, 
Xj)\'^, which is less or equal to ;U^(X)||Xp as before. 

The claim (3.30) is now a simple application of Bernstein's inequality (3.27). 
□ 

Lemma 3.5 establishes that (3.23) holds with probability at least 1 — 
2p-2iog2 _ 2\I\p~'^. Indeed, on the event maxj < (logp)"-^/^, it follows 
from Lemma 3.3 that 

p(^ma,x|(Wi,sgn(/3/))| >2^ <2\I\e-^^°^P <2\I\p-^. 
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4. Discussion. 

4.1. Connection with other works. In the last few years, there have been 
many beautiful works that attempt to understand the properties of the lasso 
and other minimum ii algorithms, such as the Dantzig selector when the 
number of variables may be larger than the sample size [3, 5, 6, 10, 13, 15, 
16, 20, 21, 29, 30]. Some papers focus on the estimation of the parameter 
/3 and on recovering its support; others focus on estimating Xp. These are 
quite distinct problems, especially when p > n; consider, for instance, the 
noiseless case. 

In [5, 6, 13], it is required that the level of sparsity S be smaller than 
l/fj,{X). For instance, [5] develops an oracle inequality that requires S< 
l/(32/x(X)). Even when fJ.{X) is minimal [i.e., of size about 1/y/n, as in the 
case where X is the time-frequency dictionary or about ^(21ogp)/n as for 
Gaussian matrices and many other kinds of random matrices] one sees that 
the sparsity level must be considerably smaller than ^/n. When the coherence 
is of the order of (logp)~^ (as we have allowed in our paper), one would need 
a sparsity level of order logp. Having a sparsity level substantially smaller 
than the inverse of the coherence is a common assumption in the modern 
literature on the subject, although, in some circumstances, a few papers 
have developed some weaker assumptions. To be a little more specific, [29] 
reports an asymptotic result in which the lasso recovers the exact support 
of P provided that the strong irrepresentable condition of Section 3.3 holds. 
The references [20, 28] develop very similar results and use very similar 
requirements. The recent paper [17] develops similar results but requires 
either a good initial estimator or a level of coherence on the order of n~^/^. In 
[10, 21] the singular values of X restricted to any subset of size proportional 
to the sparsity of /? must be bounded away from zero while [3] introduces an 
extension of this condition. In nearly all these works, a sufficient condition 
is that the sparsity be much smaller than the inverse of the coherence. 

4.2. Our contribution. It follows from the previous discussion that there 
is a disconnect between the available literature and what practical experience 
shows. For instance, the lasso is known to work very well empirically when 
the sparsity far exceeds the inverse of the coherence 1 / ^^{X) [13] , even though 
the proofs assume that the sparsity is less than a fraction of 1/^{X). In 
that paper, the coherence is l/\/n so that, as mentioned earlier, results are 
available only when the sparsity is much smaller than -y/n, which does not 
explain what series of computer experiments reveal. 

Our work bridges this gap. We do so by considering the performance of the 
lasso one expects in almost all cases but not all. By considering statistical 
ensembles much as in [9], one shows that, in the above examples, the lasso 
works provided that the sparsity level is bounded by about n/logp; that is, 
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for generic signals, the sparsity can grow almost linearly with the sample size. 
We also prove that, under these conditions, the "Irrepresentable Condition" 
holds with high probability, and we show that, as long as the entries of /3 are 
not too small, one can recover the exact support of (3 with high probability. 

Finally, there does not seem much room for improvement, as all of our 
conditions appear necessary as well. In Section 2, we have proposed special 
examples in which the lasso performs poorly. On the one hand, these ex- 
amples show that, even with highly incoherent matrices, one cannot expect 
good performance in all cases unless the sparsity level is very small. And on 
the other hand, one cannot really eliminate our assumption about the co- 
herence, since we have shown that, with coherent matrices, the lasso would 
fail to work well on generically sparse objects. 

One could of course consider other statistical descriptions of sparse /3's 
and/or ideal models, and leave this issue open for further research. 
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